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Abstract —We present MLEHapIo, a maximum likelihood de novo assembly algorithm for reconstructing viral haplotypes in a virus 
population from paired-end next generation sequencing (NGS) data. Using the pairing information of reads in our proposed Viral Path 
Reconstruction Algorithm (ViPRA), we generate a small subset of paths from a De Bruijn graph of reads that serve as candidate paths 
for true viral haplotypes. Our proposed method MLEHapIo then generates a maximum likelihood estimate of the viral population using 
the paths reconstructed by ViPRA. We evaluate and compare MLEHapIo on simulated datasets of 1200 base pairs at different 
sequence coverage, on HCV strains with sequencing errors, and on a lab mixture of five HIV-1 strains. MLEHapIo reconstructs full 
length viral haplotypes having a 100% sequence identity to the true viral haplotypes in most of the small genome simulated viral 
populations at 250x sequencing coverage. While reference based methods either under-estimate or over-estimate the viral haplotypes, 
MLEHapIo limits the over-estimation to 3 times the size of true viral haplotypes, reconstructs the full phylogeny in the HCV to greater 
than 99% sequencing identity and captures more sequencing variation for the HIV-1 strains dataset compared to their known 
consensus sequences. 

Index Terms —de novo assembly, maximum likelihood estimation, viral haplotype reconstruction, complete phylogeny, backward 
elimination, paired-end reads de novo assembly. 
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1 Introduction 

V IRUSES replicating within a host exist as a collection of 
closely related genetic variants known as viral haplo¬ 
types. The diversity in a viral population, or quasispecies, 
is due to mutations (insertions, deletions or substitutions) 
or recombination events that occur during virus replication. 
These haplotypes differ in relative frequencies and together 
play an important role in the fitness and evolution of the 
viral population 0, 0, 0, 0. 

Viral population diversity was traditionally studied by 
Sanger sequencing of Polymerase Chain Reaction (PCR) 
products. This approach has inherent problems that the 
sample diversity is inadequately assessed, errors are in¬ 
troduced by PCR, and the complete genome of the virus 
often is not analyzed. The next generation sequencing (NGS) 
technologies, on the other hand, can sample millions of 
short reads from the complete viral population. However, 
they introduce a new challenge of assembling an unknown 
number of original haplotypes from the short reads. The 
presence of sequencing errors in short reads in particu¬ 
lar confounds the true mutational spectrum in the viral 
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population because viral error rates and sequencing error 
rates are similar. Several algorithms have been proposed 
for reconstructing haplotypes in a population using a high- 
quality alignment of short reads to a reference viral genome 
(See 0 for review). Reference based reconstruction of viral 
haplotypes can be performed either locally or globally. Local 
haplotype estimation involves inferring haplotypes in short 
segments along a viral reference sequence to which the reads 
are aligned |61, 0. Error correction is generally performed 
before |8|, |9|, |10| , (TT] or along with local haplotype 
estimation 0, 0, |12l|, [ [131 . Global estimation involves 
estimating viral haplotypes over segments larger than the 
length of reads. For global estimation, the reads aligned to a 
reference or to a consensus sequence are generally arranged 
in a read-graph, where vertices in the graph are reads 
and edges denote overlaps amongst the aligned reads. As 
the number of haplotypes in a population is unknown, an 
optimal set of viral haplotypes that explains the read-graph 
is obtained. There are a number of optimization frameworks 
for analyzing the read-graph (14), for example by modeling 
it as a network flow problem |15|, (^, minimal cover for¬ 
mulation, and as a maximum bandwidth paths formulation 
(Tt) . Probabilistic models have also been explored for local 
haplotype estimation 0, 0, global haplotype estimation 
| |10| , (Tt) , and for analyzing recombinations amongst viral 
haplotypes | p^ |. 

The reference based methods rely on alignment of reads 
to a reference sequence for error correction, orientation of 
reads, and for reconstruction of the viral population. A refer¬ 
ence sequence can be helpful when it is representative of all 
haplotypes present in the viral population. However, due to 
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the presence of insertions and deletions, recombination and 
high mutation rates in some viral populations (e.g. RNA 
viruses), a large percentage of the reads are unaligned to the 
reference genome, and are ignored while estimating viral 
diversity. Additionally, the reference based algorithms are 
known to have high false-positive rates in reconstruction 
of viral haplotypes and over-estimate the number of viral 
haplotypes in datasets with low sequencing diversity | [T8| , 
(T^ , which is expected in a virus population obtained from 
an infected individual. 

De novo methods for assembling viral haplotypes pro¬ 
vide an alternative to reference-based haplotype estimation, 
where the viral haplotypes are reconstructed directly from 
the sampled reads. There are two broad types of de novo 
assembly methods: overlap-layout-consensus and De Bruijn 
graph based methods. These methods have been widely 
used for assembly of reads from a single genome. For viral 
populations, overlap-layout-consensus based de novo assem¬ 
bly methods have been used to generate a consensus viral 
sequence pO) . ITere, similar short reads are aligned to each 
other to generate a multiple sequence alignment (MSA) of 
the reads. A consensus viral sequence is then obtained by a 
base-by-base majority voting from the MSAs. TTowever, this 
method assembles consensus sequences of highly diverse 
viral populations and is not suitable if the aim is to inves¬ 
tigate the diversity of the viral population. De Bruijn graph 
based de novo methods first break the short reads into k- 
mers which form vertices in the graph and edges are drawn 
in the graph between overlapping fc-mers observed in the 
short reads. For assembly of reads into a single genome, a 
single path in the De Bruijn graph that visits every vertex 
can be used to represent the genome sequence. TTowever, 
reconstructing the viral haplotypes in a population from the 
De Bruijn graph would involve simultaneously assembling 
multiple paths that collectively visit every vertex (a graph 
cover) and each path represents a single viral haplotype 
in the population. The problem of simultaneous assembly 
from short reads in a single individual is computationally 
intensive and challenging even for estimating the two paths 
corresponding to diploid strains in the individual (^ . 

With the availability of paired-end sequencing data, 
methods that explicitly incorporate paired end read infor¬ 
mation in an aligned read graph for reconstructing viral 
haplotypes have been explored ||22|, p3) , p^ |. TTere, instead 
of computing a minimal set of haplotypes that explains 
all the paired-reads, the method iteratively merges all the 
fully connected clusters of reads (max-cliques) in the read- 
graph to generate viral haplotypes. The method, however, 
has exponential time complexity in the read coverage (22| . 
On the other hand, the problem of reconstructing a minimal 
set of haplotypes under the constraints of paired-end reads 
from a read-graph or a De Bruijn graph is NP-hard (25) , 
Thus, only heuristic algorithms for estimating viral 
haplotypes from paired-end sequencing reads are possible 

m- 

We present a maximum likelihood based de novo assem¬ 
bly algorithm for estimating viral haplotypes using paired- 
end sequencing data. The main advances made in this paper 
are (i) we utilize the pairing information of the paired 
reads to compute a score for paths in a De Bruijn graph 
that is generated from the overlaps in the reads, (ii) we 


develop a novel polynomial time heuristic algorithm. Viral 
Path Reconstruction Algorithm (ViPRA), that generates M— 
paths corresponding to top M scores through every vertex 
in the De Bruijn graph, and (ii) we utilize an algorithm 
MLETTaplo that provides a maximum likelihood estimate 
of the viral population based on the proposed generative 
model. 

We extensively evaluate our algorithm on simulated 
datasets varying over sequence diversity, genome lengths, 
coverage, and presence of sequencing errors. As a repli¬ 
cating viral population consists of viral haplotypes that 
are derived from a recent common ancestor, our simulated 
datasets are modeled using a Bayesian coalescent simula¬ 
tor p7) , (^ which generates sequences derived from a 
realistic evolutionary model of divergence. Our goal is to 
successfully reduce the number of paths while retaining the 
true haplotypes and to recover the viral population accu¬ 
rately and with high precision. Our results demonstrate that 
MLETTaplo is able to retain the phylogenetic signature in all 
simulated datasets where there is phylogenetic support for 
all haplotypes in the population, and it generates full length 
or near full length haplotypes. We compare our results to 
existing reference and de novo assembly based methods, 
and observe that MLETTaplo does not suffer from a high 
false positive rate and a bias towards the reference sequence 
which occurs in the reference based methods. 

2 Methods 
2.1 Definitions 

A viral population H is a collection of viral haplotypes 
{Hi, H 2 ,..., Hp{, where each haplotype Hi is a string of 
length GSi defined over the alphabet S = {A,G,C,T}. The 
sequence similarity between any two haplotypes Hi, Hj G 
H is assumed greater than 90% as they are all generated 
via mutations and recombinations on a recent common 
ancestral sequence. A paired read {Rf,Rj.) is a pair of 
two strings over the alphabet S that are sequenced using 
NGS technologies from an IS length segment of haplo¬ 
type Hi G H, i G {1,2,..., P}. The collection of paired 
reads sampled from the viral population H is denoted as 

The statistics of 

insert length IS for the NGS reads are known, although the 
insert length for a particular paired read is unknown. The 
average number of times a segment of the viral haplotype is 
sequenced is known as the coverage of the NGS sequencing. 

A substring u of length k from Hi, Rf, or Rr is known 
as a fc-mer and the reverse complement of the fc-mer u is 
denoted as u. The number of times a /c-mer u is observed in 
the sampled reads is known as count of the fc-mer u. The two 
fc-mers spanning a (fc -F 1) length string in a read are known 
as consecutive fc-mers. A De Bruijn graph is a representation 
of the sampled reads and consists of fc-mers as its vertices. 
Directed edges are drawn in the graph between consecutive 
fc-mers from the first to the second fc-mer. The vertices 
representing consecutive fc-mers are known as consecutive 
vertices, and a sequence of consecutive vertices in the graph 
is known as a path. A path starting at a vertex with no edges 
into it {source vertex) to a vertex with no edges going out of 
it {sink vertex) is known as a source-sink path in the graph. 


2.2 Pre-processing of reads: Error Correction, De 
Bruijn Graph Construction and Paired Constraints Set 

2.2.1 Error Correction 

In this paper, error-free paired reads are simulated first for 
calibration of the software. For NGS simulated reads and 
reads from real sequencing studies, error correction software 
BLESS IM) was used for error correction. As this software 
has been developed for reads sequenced from a single 
genome, we perform additional error-correction when re¬ 
constructing paths in the De Bruijn graph. A path that either 
begins or ends in vertices corresponding to fc-mers that have 
Ic-mer counts less than the threshold for BLESS correction 
(also known as tips in the De Bruijn graph) are removed. 
These vertices typically correspond to sequencing errors, 
and this technique has parallels in existing softwares for 
assembly of single genomes [ |3^ . 

2.2.2 De Bruijn Graph 

The fc-mers from the error corrected paired reads are rep¬ 
resented in a De Bruijn graph. As the orientation of the 
sequenced reads are unknown, fc-mers from the paired reads 
and their reverse complements are stored in the graph. In 
order to reduce storage space, the De Bruijn graph G is con¬ 
verted into a condensed graph Gc in which linear chains of 
vertices are condensed into a single vertex, while preserving 
the edge relationships in the graph G. This technique has 
also been used in the assembler SPADES for single genomes 

Eor viral populations, specially for RNA viruses, it is 
possible to obtain an acyclic De Bruijn graph for reasonable 
values of fc (around 50-60) as the length of repeats in 
the viral genomes are typically small (except for terminal 
repeats in some viral genomes). Two properties of a directed 
acyclic De Bruijn graph G or the condensed graph Gc 
are particularly useful for reconstructing sequence of viral 
haplotypes. Eirst, the insert size IS of a particular paired 
read can be computed using the distance d{u, v) between 
two vertices u and v corresponding to the beginning and 
end of the paired read. This distance d{u,v) is defined as 
the length of strings spelled by the shortest u — v path in 
the graph, which can be computed uniquely for a directed 
acyclic graph. Second, a haplotype of the viral population 
is spelled by a source-sink path in the graph G or Gc. A 
collection of such source-sink paths that spans all vertices (a 
path cover) represents the observed viral population. 

2.2.3 Paired constraints set (PS) 

A Paired Constraints Set PS is generated from the paired 
reads and contains a list all pairs of fc-mers observed in 
the paired reads along with the number of times a fc-mer 
pair is observed. The set PS for graph Gc can be computed 
once in time linear in the number of reads and quadratic 
in the length of the reads. As there are millions of source- 
sink paths in a typical graph for viral populations, the 
pairing information of the paired reads is used to guide 
their selection. The elements of the paired set PS indicate 
whether a pair of fc-mers appear together in one of the viral 
haplotypes in the population. 
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2.3 Scoring source-sink paths in the graph using the 
set PS 

Consider a path P = {ui,U 2 ,U 3 ,... ,Ur) in the graph G, 
where mi is the source vertex, Ur the sink vertex. The vertices 
{ui,U 2 ,U 3 ,..., Ur) correspond to the fc-mers in the sampled 
reads. Assuming that all paired fc-mers from the viral pop¬ 
ulation separated by insert size distance IS are sampled in 
the observed reads, we define a score S{P) for the path P 
using the elements of the paired set PS as follows: 


S(P) 


1 

W)' 


{l[(r,s)eP5] 

(r,s)GPn [d{s) — d{r)<IS] 


pen ■ l[(r, s) ^ PS]} 


( 1 ) 


The score for the path P is defined over vertex-pairs 
{(r, s); r G P &c s G P} that are within distance IS on the 
path P. It is proportional to the number of vertex-pairs 
present in PS. The score is also penalized by a penalty 
term pen for every vertex-pair in P within IS that is not 
present in PS, as under assumption of high coverage, all 
pairs of fc-mers from true viral haplotypes within distance 
IS would have been observed in the paired reads. The score 
is normalized by E{P), the expected number of vertex-pairs 
in a path P that are within the insert size IS. 

Paths in the graph that have high scores are candidates 
for possible viral haplotypes (See supplementary text for 
rationale). Moreover, a path cover of the graph consisting 
of such paths is a candidate for the viral population H. 
Equation can be modified in the presence of sequencing 
errors, wherein, instead of membership in PS within a 
distance IS, the contribution of a vertex pair (m,d) G PS 
is weighted by its relative frequency of occurrence. 


2.4 ViPRA: A heuristic aigorithm for estimating top M 
paths per vertex 

We propose Viral Paths Reconstruction Algorithm (ViPRA), 
a heuristic algorithm that computes a path cover of the 
graph using high scoring M paths per source vertex in the 
graph G. ViPRA computes M high scoring paths through 
a vertex using precomputed M high scoring paths through 
the vertex's neighbors. It starts with the sink vertices in the 
graph and iteratively builds on them using memoization 
of M paths through each vertex to generate high scoring 
source-sink paths in the graph. This is possible as the score 
for a path P can be constructed using the scores of its 
sub-paths starting at a particular vertex to the end (See 
Algorithms 1, 2 in supplementary data). 

The goal of the scoring mechanism S{P) is to ensure 
that the paths corresponding to true haplotypes have a 
high score and thus they are retained in the top paths for 
a vertex as the algorithm propagates from the sink vertex 
to the source vertex. Thus, the choice of M is important as 
it would affect the number of paths generated per source 
vertex. Eor vertices that are not spanned by paths from the 
source vertices, additional M paths are generated through 
these vertices. 
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2.5 Maximum Likelihood Estimate of the viral popula¬ 
tion 

The path cover generated by ViPRA is an over-estimation of 
the possible paths that represent the viral population. We 
generate a maximum likelihood estimate of the population 
using these paths as a starting point. 

The generative model for sampling paired reads from the 
viral population H is as follows: For a paired-read {Rf,Rr) 
of insert length IS, it can be sampled from only a single 
location in the viral haplotype under the assumption that 
there are no long repeats in viral haplotypes. Thus, the 
probability of observing the paired read {Rf,Rr) from the 
viral population H can be expressed as the ratio of number 
of haplotypes that share such a read segment to the total 
number of locations from which any paired read of length 
IS can be sampled: 

In this equation, q is the number of haplotypes in H 
that have {Rf,Rr) as their sub-string, P is the number of 
haplotypes in H, and GS is the average length of the viral 
haplotypes. It should be noted that Equation]^ also holds if 
{Rf,Rr) denotes a /c-mer pair. 

Thus, for a collection of paired reads 
{ ('^l/ 5 ) 5 (f^2/ 5 -^2r ) 5 ■ ■ ■ 5 5 )} where (^Rij^Rir) 

is sampled Ci times, assuming independent sampling, the 
joint probability of observing the paired reads given the 
viral population H can be expressed as a multinomial 
expression: 


R\r) — Cl, ... , ^ Rnr] — — 

n 4 ‘ 


Cl! • C 2 !.. .c„ 


(3) 


i=l 


where, M = Yl7=i G = n. 

Given the above formulation, a maximum likelihood set 
of haplotypes H^i can be estimated using equation as 
follows: 


Hjnl — '^({^('^1/5 ^Ir') — Cl, . . . , ci^Rjif , Rjir) — 1 ^) 

VH 

(4) 


2.6 Backward Elimination for estimating Hmi 

The top M—paths through all the source vertices is used 
as candidates for possible haplotypes for computing the 
maximum likelihood set of haplotypes. The maximum like¬ 
lihood set of haplotypes for a given dataset is computed 
using backward elimination. The likelihood computation 
begins with all the top M—paths through the source vertices 
and iteratively remove one path from the set of all paths 
until the likelihood of the remaining paths in the set starts 
to decrease. The remaining set of haplotypes constitute a 
maximum likelihood estimate of the viral population. 


2.7 Simulation Data Generation 

Viral haplotypes in a viral population are related to one 
another due to mutations or exchange of genome segments 
that accrue during replication. Thus, the viral haplotypes 
share recent common ancestor sequences from which all the 
viral haplotypes have originated. In order to model this, 
we generate the simulated viral populations using Bayesian 
Serial SimCoal Simulator (BayeSSC) (^ , The simulator 
takes population parameters such as population size, muta¬ 
tion rates, and genome length as input and generates a set of 
sequences that have mutations with respect to a recent com¬ 
mon ancestor sequence. The error-free lllumina sequencing 
paired-reads are simulated from these viral haplotypes us¬ 
ing dwgsim (https://github.com/nhl3/DWGSlMl, which 
are used as input to evaluate our proposed method. 

3 Results 

De Bruijn graph based methods are amenable to address the 
problem of recovering viral haplotype diversity from shot¬ 
gun or amplicon sequence data if the challenges discussed 
above can be solved. Thus an overview of our approach is as 
follows. As sequencing errors increase the complexity in a 
De Bruijn graph, we first error correct the sequenced paired 
reads using existing error correction softwares and store the 
A:-mers from the error corrected reads in the De Bruijn graph 
G, where the vertices are fc-mers and consecutive fc-mers in a 
read form edges (Figure]^. The vertices in the graph with no 
incoming edges are known as source vertices, while vertices 
with no outgoing edges are known as sink vertices. The 
paths in the graph starting at a source vertex and ending in 
a sink vertex are candidates for viral haplotypes. In order to 
prune through the millions of source-sink paths in the graph, 
we store in a paired set the pairing information of fc-mers 
observed in the error corrected reads along with their num¬ 
bers of occurrence. Our proposed Viral Path Reconstruction 
Algorithm (ViPRA), a heuristic polynomial time algorithm, 
uses a parameter M and the evidence from the paired set 
to retain a small number of source-sink paths as candidate 
haplotypes. We limit the over-estimation of number of hap¬ 
lotypes in the viral population by our proposed maximum 
likelihood estimator, MLEHaplo. MLEHaplo estimates the 
likelihood that the reads obtained are derived from a set 
of full-length sequences in the viral population using the 
paths reconstructed by ViPRA as putative candidates for 
these sequences. 

3.1 Simulation studies for partial genome reconstruc¬ 
tion of viral populations 

Our first evaluation is on simulated datasets to test ViPRA's 
ability to reduce the total number of paths for possible 
viral haplotypes and to retain the paths for true haplotypes. 
The goal of MLEHaplo is to accurately reconstruct the 
true haplotypes from the output of ViPRA, while limiting 
the over-estimation of the viral population. The length of 
the simulated haplotypes are 1200 base pairs (bps), which 
resembles the length of a gene or a segment in commonly 
known viruses used for evolutionary studies. Error-free 
paired end reads are simulated from viral populations 
that were generated from a common ancestral sequence 
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A. Viral 
Population 


B. Sequencing 



C. Error 
Correction 



D. De Bruijn 
Graph and 

paired set i = = | = = | = = i==i==i= = 


« 

E. ViPRA — - - 


F. MLEHapIo 


Fig. 1. Reconstruction of viral haplotypes using PE data: (A) A viral 
population consisting of four viral haplotypes (red, green, dark blue, blue 
lines) which mutations (red dots) and a recombinant haplotype (red- 
green line) are depicted here. (B) Sequencing technology generates 
paired-reads with sequencing errors (black dots on short segments from 
reads). (C) Error correction using existing softwares generates error-free 
reads (D) The error-corrected reads are stored in a De Bruijn graph. The 
pairing information of a paired read constitutes all fc-mer pairs observed 
in it, and it is stored in the set PS. Colors denote fc-mers obtained from 
the corresponding viral haplotypes in (A). (E) The proposed heuristic 
algorithm ViPRA reconstructs only a fraction of the total number of paths 
in the De Bruijn graph, where paths reconstructed have a high score that 
measures the support of fc-mer pairs from a path found in the set PS. 
(F) MLEFIapIo reconstructs a maximum likelihood estimate of the viral 
population, where the sampled reads are modeled to be sampled from a 
viral population consisting of paths generated by ViPRA as the starting 
point. 


reads from each dataset (Dl-DlO) are stored in De Bruijn 
graphs with fc-mer size of 60, while pairs of fc-mers are 
stored in the paired set. Although the ten datasets are 
generated under the same evolutionary parameters, there is 
a 1000-fold difference in the total number of source-sink paths 
present in the error-free De Bruijn graphs for these datasets 
(Table [^. This indicates that the complexity of a graph 
for a viral population varies significantly even when there 
are no sequencing errors and under the same evolutionary 
parameters. 


3.1.1 Evaluation of ViPRA with varying M 

Our algorithm ViPRA prunes through millions of paths in 
the graph to generate a set of paths having a high paired- 
end support score, as explained below, that form candidates 
for viral haplotypes. It takes as input a parameter M, the 
De Bruijn graph G, and the paired set PS for a given 
sequencing data. The output of ViPRA is a path cover of 
the graph containing M source-sink paths for each source 
vertex in the graph G. The M paths per source vertex are 
ranked based on their support present in the paired set PS. 
A negative penalty is applied to a path when there is no 
support found for its fc-mer pairs within insert size IS in the 
set PS and we only consider paths with positive scores for 
evaluation. Additional M paths are also generated through 
vertices not visited by any of the top M paths generated 
from the source vertices to obtain the path cover of the graph. 

We assess the number of paths reconstructed by ViPRA 
with the variation in parameter M. The number of paths 
recovered is independent of the total number of source-sink 
paths in G for a given dataset but is directly proportional 
to the parameter M and overall complexity of the graph G 
(Figure]^ Table [^. Paths for full length and partial length 
haplotypes are obtained for M = (5,10), while full length 
paths are obtained for M > 15 (Figure]^ indicating that not 
all vertices were visited by the source-sink paths for small 
values of M. We also evaluate the ratio of the true haplo¬ 
types recovered by ViPRA to the number of true haplotypes 
present in the viral population, or recall. There is 100% 
recall of the seven true haplotypes in all the ten datasets 
for M > 10, and in nine out of ten datasets for M = 5. This 
suggests that M = 10 is sufficient for recovering the true 
haplotypes for datasets Dl-DlO. The paths reconstructed at 
M = 10 are used by MLEHapIo for the maximum likelihood 
estimation. 


using a coalescent approach |j^, |[^| at a mutation rate 
of 10“® per generation per nucleotide and a population 
size of 5000, which are typical evolutionary parameters for 
a replicating virus population ||^, |^, The average 
sequencing depth of the 150 bp paired reads is 250x and 
average insert size is 230 bp (standard deviation of 75 bp). 
Ten viral populations (denoted as Dl-DlO) were generated 
using a different seed sequence under the same evolutionary 
parameters, each of them containing seven phylogenetically 
related haplotypes. 

In order to reconstruct the viral haplotypes, the paired 


3.1.2 Evaluation of MLEHapIo 

MLEHapIo generates a maximum likelihood estimate Hj„i 
of the viral population using the paths reconstructed by 
ViPRA. It reduces the number of paths from ViPRA by 
iteratively removing one haplotype by backward elimina¬ 
tion. The shape of the likelihood surface indicates that the 
likelihood is maximized when the set H^i contains 7—10 
paths which includes the true number of viral haplotypes in 
datasets Dl-DlO (Figure SI). MLEHapIo has 100% recall in 
6 out of 10 datasets, and recovers six out of seven true viral 
haplotypes in the remaining datasets with an exact sequence 
match to the true haplotypes (Table [^. 
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Fig. 2. Evaluation of ViPRA and MLEHapIo on error-free data. (A) A boxplot of the number of paths obtained from ViPRA when M ranges 
between 5-35 for datasets D1-D10 (B) A boxpiot of iength distribution of paths generated by ViPRA, presented as a fraction of the totai iength 
of 1200bp, for datasets D1-D10. Aimost aii the paths are fuii iength for M > 15. (C-F) Performance of ViPRA and MLEFIapio with variation in 
sequencing coverage, at 25x, SOx, lOOx, 200x, and 400x denoted on the x-axis of each figure, for datasets D6 and DIO. (C) Boxpiot of the totai 
number of paths recovered by ViPRA in datasets D6 (biue) and DIO (green). At SOx or greater coverages, ViPRA reconstructs 344 paths for DIO 
(shown as a iine at 344). (D) Boxpiot of recaii of true hapiotypes for datasets D6 (biue) and DIO (second coiumn at each coverage) in paths 
recovered by ViPRA. The recaii of ViPRA is 100% at aii sequencing coverages for DIO, thus appearing as a iine at 1. (E) Boxpiot of number of 
paths in the set Hmi reconstructed by MLEHapio in datasets D6 (biue) and DIO (green). (F) Boxpiot of MLEFIapio’s recaii for true hapiotypes in 
datasets D6 (blue) and DIO (green). MLEFIapio’s recall for DIO is > 90% at all coverages, the recall for D6 increases with increasing coverage. 
Recall is defined as the fraction of the number of true hapiotypes that have a reconstructed path with exact sequence match. 


3.1.3 Phylogenetic content of the maximum likelihood esti¬ 
mate 

We compare the phylogeny of the true hapiotypes to the 
set of reconstructed hapiotypes Hmi for those datasets with 
less than 100% recall. A bootstrap neighbor joining tree 


is generated using the predicted set H^i and true set of 
viral hapiotypes for datasets Dl, D2, D6, and D8 (Figure 
S2). In these four datasets, six of the seven true hapiotypes 
have a predicted haplotype with exact sequence match. For 
the seventh true haplotype, the set H^i contains predicted 
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haplotypes (hollow green circles) that cluster with the true 
haplotype (hollow red circles) in the phylogenetic tree with 
near zero branch lengths. The data also demonstrates that 
MLEHaplo recovers haplotypes that recapitulate the phy- 
logeny of the true viral population even when haplotypes 
are closely related to one another, as evidenced by the 
shorter branch lengths in D7 and DIO (Figure S3). Thus, 
even when MLEHaplo overestimates the number of actual 
haplotypes in a population, the phylogenetic information 
for the true set is retained. 

3.1.4 Effect of variation in coverage 

We assess the performance of ViPRA and MLEHaplo at 
varying sequencing coverages, as the recovery of low fre¬ 
quency variants may be affected in the presence of dominant 
viral strains in a population. Error-free Illumina sequencing 
paired reads of length 150 bp each with an average insert 
size of 430 bp (standard deviation 75 bp) are simulated from 
datasets D6 and DIO at coverages of 25x, 50x, lOOx, 200x, 
and 400x. The insert size chosen here is close to the average 
insert size for newer Illumina sequencing technologies. We 
compare datasets D6 and DIO because they represent viral 
populations containing haplotypes with short leaf branch 
lengths and long leaf branch lengths and have a 1000-fold 
difference in the total number of source-sink paths (Table 
Figure S2, Figure (A), (B) in S6). Thus viral haplotypes in 
dataset D6 are more closely related than those in dataset 
DIO. At each sequencing depth, we simulated five replicates 
of paired end Illumina sequencing from D6 and DIO and 
report the average performance over these replicates. 

As the performance of ViPRA at low sequencing cov¬ 
erage is unknown, the parameter M is relaxed from its 
sufficient value of 10 demonstrated above for Dl-DlO and 
set to M = 35. For the more diverse viral population DIO, 
ViPRA reconstructs 344 paths and this number is indepen¬ 
dent of sequencing coverage. The recall of true haplotypes 
is 100% in DIO at all coverages, indicating that even 25x 
coverage is sufficient for ViPRA to reconstruct all the true 
haplotypes in diverse viral populations (Fig. 2). In low 
complexity dataset D6, the number of reconstructed paths 
decreases as coverage increases, and converges to around 
200 at coverages greater than lOOx (Fig. 2). The increase 
in number of paths reconstructed by ViPRA at coverages 
less than lOOx is due to reconstruction of both partial length 
and full length paths in D6 (Figure S4). For low diversity 
dataset D6, the median recall is close to 90% at 25x coverage 
and increases to a median value of 100% at 200x and 400x 
coverage. 

MLEHaplo generates the set Hmi containing a median 
value of around 20 haplotypes for dataset DIO at all se¬ 
quencing coverages, while the number for D6 decreases 
from 41 at 25x coverage to 26 at 400x sequencing coverage 
(Fig. 2), indicating that performance in low diversity pop¬ 
ulations (similar to dataset D6) improves with increasing 
coverage. The decrease in number of paths at convergence 
in D6 can be attributed to the increase in support for source- 
sink paths in the set PS. The set Hmi does over-estimate the 
number of true viral haplotypes by 3 — 6 times the size of 
the viral population, although the over-estimation decreases 
for D6 with increasing coverage. The median value of recall 
of the true haplotypes increases from 45% at 25x coverage 


to 100% at coverages greater than 200x in dataset D6. For 
dataset DIO, MLEHaplo has more than 90% recall at each 
coverage (Fig. 2). Thus, 200x coverage is sufficient to achieve 
a median value of recall of 100% in both datasets. 

3.2 Comparison to existing methods 

We compare the haplotypes predicted by MLEHaplo on 
datasets Dl-DlO (Results in Table to those obtained 
from the softwares ShoRAH j^, QuasiRecomb fl^ , and 
PredictHaplo j7|. The consensus sequence obtained by the 
de novo assembler Vicuna was used as the reference 
sequence for each method as they require a reference se¬ 
quence. Through personal communication, the correspond¬ 
ing authors of the softwares VGA p3) and HaploClique 
||^ indicated that their softwares were not being actively 
developed, thus were not included in comparison. ShoRAH 
and QuasiRecomb over-estimate the number of predicted 
haplotypes, while PredictHaplo under-estimates the num¬ 
ber of viral haplotypes in eight out of ten datasets (Table [^. 
QuasiRecomb predicts greater than 1900 haplotypes for all 
datasets that only contain seven true haplotypes, although 
a large number of haplotypes are reported as low frequency 
variants that have relative population frequencies less than 
5 • 10“"^. None of these methods have 100% recall for any of 
the ten data sets and all fail to recover any true haplotypes 
in at least one of the datasets. In contrast, MLEHaplo pre¬ 
dicts 6-10 haplotypes for all the ten datasets and correctly 
reconstructs 6 — 7 out of seven true haplotypes in all the ten 
datasets. Thus, MLEHaplo retains the correct sequences of 
the haplotypes while accurately predicting the number of 
haplotypes in each viral population. 

3.3 Performance on HCV populations under the pres¬ 
ence of sequencing errors 

Our simulated data were generated under a realistic evo¬ 
lutionary model but fail to capture the constraints on viral 
sequences that results in a non-random distribution of sub¬ 
stitutions in the viral population. Keeping this in mind, we 
assess the performance of ViPRA and MLEHaplo on HCV 
populations. The E1/E2 genes (length 1672 bp) from ten 
HCV strains observed in patient C from a previous study 
are used as our viral population | j3^ , where the distribution 
of conserved and variable sites in the E1/E2 genes is not 
uniform across the length of the genome (Figure S5). The 
strains are named IC-IOC, out of which strains 2C and 9C 
are distantly related to the other eight strains (Figure S6 
(C)). Strains 3C, 4C, and IOC are either identical or differ 
from each other by 1 bp and have near-zero branch length 
when constructing a neighbor joining phylogenetic tree of 
the ten strains (Figures 3(A) and 3(B)). Thus, all methods 
are assessed for the recovery of eight E1/E2 genes in the 
HCV strains. 

A dataset HCV-U is generated by simulating paired 
reads from all the ten strains at equal frequencies using the 
software SimSeq p^ |. Another dataset is generated from the 
same HCV strains where the relative abundances follows 
power law with factor 2, denoted as dataset HCV-P. The 
average sequencing coverage is around 500x with insert 
sizes 300 bp and read lengths 150 bps. The reads in the 
datasets HCV-U and HCV-P also include sequencing errors 


to determine how it affects the performance of ViPRA and 
MLEHaplo. The software BLESS (M) was used for error 
correction and the error corrected reads are represented in 
the De Bruijn graph. In addition to the error correction from 
BLESS, the source-sink paths in the graph that terminate in 
vertices with fc-mer counts less than the threshold in BLESS 
(tips in the De Bruijn graph (35) ) are ignored when paths are 
reconstructed by ViPRA. 

As sequencing errors inflate the number of haplotypes 
generated by each method, we report the number of distinct 
paths generated by a method that are different from each 
other by more than 10 bp, or greater than 99% sequence 
difference. All methods except PredictHaplo overestimate 
the number of haplotypes in the viral population (numbers 
in brackets Table]^. MLEHaplo and ShoRAH predict similar 
number of distinct paths, while QuasiRecomb overestimates 
the distinct paths in the HCV-U dataset. The paths generated 
by MLEHaplo cluster well to generate a small number of 
distinct paths for both the HCV-U and HCV-P datasets 
(Eigure S7). The distribution of all pairwise distances for 
the predicted haplotypes indicates that while MLEHaplo, 
PredictHaplo, and ShoRAH have similar distributions to the 
true HCV strains, all of the paths generated by QuasiRe¬ 
comb have small pairwise distances and the distribution of 
pairwise distances differs from that of the true HCV strains 
(Eigure S8), which leads to small number of distinct paths 
in QuasiRecomb. We also compute the recall of the eight 
true E1/E2 HCV strains by each method to within 10 bp 
of sequencing difference. MLEHaplo has the highest recall 
amongst all methods that does not change on the HCV-U 
and HCV-P datasets, while the recall of other methods de¬ 
creases in the HCV-P dataset (Table [^. All methods reduce 
the length of the reconstructed paths, as the terminal ends 
of the strains have low sequencing coverage and are not 
reconstructed accurately. 

We assess the quality of the reconstructed haplotypes 
from each method by constructing a phytogeny of the re¬ 
constructed and true haplotypes (Eigure [^. As mentioned 
above, the reconstructions of strains 3C, 4C, and IOC are 
considered together as they differ by 1 bp or less. In the 
HCV-U dataset, MLEHaplo reconstructs a haplotype within 
10 bp sequence difference for six of the seven distinct strains 
and also one haplotype for the strains 3C, 4C, and IOC. 
Other methods fail to reconstruct haplotypes for one or 
the other strain in both HCV-U and HCV-P datasets, with 
PredictHaplo and QuasiRecomb only recovering three out 
of the eight strains in the HCV-P dataset. The haplotype 
6C is recovered by all methods in both HCV-U and HCV-P 
datasets. All strains except 8C are reconstructed by MLE¬ 
Haplo in the HCV-U and HCV-P datasets. Por 8C strain, 
MLEHaplo predicts a haplotype that differs from the true 
strain by 15 bp in both the HCV-U and HCV-P datasets. The 
performance of MLEHaplo remains relatively unchanged 
under the power law distribution of strains in the HCV-P 
datasets, whereas other methods recover fewer haplotypes. 
Thus, MLEHaplo recovers the full phylogeny of the viral 
population as before under the uniform and power law 
distributions of viral strains. 


3.4 Evaluation on real sequencing data 

We next evaluate the performance of ViPRA and MLE¬ 
Haplo on a real sequencing data, where there are additional 
sources of errors during sample preparation and reverse 
transcription of viruses. A laboratory mixture of five known 
HIV-1 strains was sequenced using three different sequenc¬ 
ing technologies in a previous study (^, where the consen¬ 
sus sequence of each strain was also obtained (see details in 
(3^). The strains were named as YU2, NL43, HXB2, JRCSP, 
and 89.6. In their study, the HIV-1 strains were broken into 
five overlapping segments of average length 2500 bps that 
were separately shotgun sequenced using Illumina paired 
end sequencing. We used Illumina paired end sequencing 
from the last segment corresponding to the env gene for 
evaluating our method. 

As the Illumina sequencing data for the five HIV-1 
strains was obtained by transfection of 293T cells followed 
by reverse transcription of the RNA viruses p^ , we expect 
to see a large number of sequencing differences between 
the consensus sequences of the viral strains and the recon¬ 
structed viruses due to the inherent replication errors of 
the reverse transcription step. ViPRA generates 262 paths of 
average lengths 1500 bp from the error corrected env region 
reads and MLEHaplo reduces it to 78 paths. These paths 
cover the 2300 bp of the envelope gene in two overlapping 
1500 bp segments that can be joined together to reconstruct 
the full sequence of the envelope gene, as is seen in the 
tablet software alignments of the paths to the five HIV-1 
strains (^ (Eigure S9). 

We measure whether the reconstructed haplotypes are 
phylogenetically related to the five HIV-1 strains. We 
use a known HIV-1 strain subtype A (Accession number 
AB253635) as an outgroup to generate a rooted neighbor 
joining phylogeny of the five HIV-1 strains and report the 
sequence the reconstructed haplotypes that are greater than 
97% of sequencing difference to the consensus sequences 
of HIV-1 strains (Eigure |^. As the sequenced reads are 
generated from replicating viruses, none of the methods 
reconstruct haplotypes with 100% sequence identity to env 
genes. MLEHaplo reconstructs paths for the strains HXB2 
and 89.6 that have greater than 99% sequence identity to the 
consensus sequences. The strains NL43, YU2, and JRCSE 
are reconstructed with 97% sequence identity to their re¬ 
spective consensus sequences. While ShoRAH reconstructs 
haplotypes for four strains with 97% sequence identity, 
PredictHaplo reconstructs haplotypes for three strains. The 
method QuasiRecomb ran out of memory on multiple trials 
and was not evaluated. It should be noted that the refer¬ 
ence sequence used for the reference-based methods was 
generated using the assembler for diverse viral populations 
Vicuna (^. Thus, while other methods fail to reconstruct 
all the strains, MLEHaplo generates haplotypes for all the 
HIV-1 strains. 

We also evaluate whether the reconstructed haplotypes 
are representative of the sequenced reads by mapping all 
the A:-mers from the reads to the reconstructed haplotypes. 
The reconstructed haplotypes by ViPRA account for more 
than 98.46% of the error corrected fc-mers and the paths 
generated by MLEHaplo represent 97.48% of the envelope 
region fc-mers. On the other hand, only 95.56% of the 
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Fig. 3. Phylogenentic relationships of the reconstructed haplotypes on simulated reads from HCV E1/E2 genes and env gene from a set 
of 5 HIV-1 strains. Bootstrap neighbor joining trees are shown for the true haplotypes in HCV population and the set of env genes from 5 HIV-1 
strains. The reconstructed haplotypes from MLEHapIo (red M), ShoRAH (blue S), QuasiRecomb (purple Q) and PredictHapIo (dark grey P) that 
have minimum sequence difference to the true haplotypes in a viral population are indicated next to it. On datasets (A) HCV-U and (B) HCV-P, 
reconstructed haplotypes that are within 10 bp of the true sequence are reported. Inset shows the relative distributions of haplotypes in the power 
distributions and the reconstrutions of different methods are replicated on top of the true strains. Reconstruction for strains 3C, 4C, and IOC is 
considered together as they differ by one bp or are identical. (C) For the five HIV-1 strains, the neighbor joining phylogeny is shown along with 
the env-gene of an outgroup HIV-1 subtype A strain. MLEHapIo reconstructs a haplotype with greater than 98% sequence Identity to the known 
consensus sequences for all five strains, while reconstructed haplotypes by ShoRAH and PredictHapIo are also reported. QuasiRecomb ran out of 
memory on five viral mix dataset and was not evaluated. 


envelope region fc-mers are explained by aligning them to 
the known consensus sequences of the five HIV-l strains, 
indicating that the haplotypes reconstructed by MLEHapIo 
are able to capture more variation observed in the reads than 
alignment to consensus sequences while capturing the true 
phylogeny of the five HIV-1 strains. 

4 Discussion 

We have proposed MLEHapIo, a maximum likelihood de 
novo assembly algorithm for viral haplotypes using paired- 
end NGS data. The paired reads are represented in a De 
Bruijn graph and the pairing information of the reads is 
stored as pairs of fc-mers in the set PS. The support found 
in the set PS for pairs of vertices is used to score source- 
sink paths, and the scoring mechanism ensures that the 
paths represent possible viral haplotypes. As reconstructing 
a minimal cover of the graph under paired constraints is 
NP-hard, we have proposed a polynomial time heuristic 
algorithm, ViPRA, that recovers a small fraction of the total 
number of paths in the graph. MLEHapIo then reconstructs 
a maximum likelihood estimate of the viral population 
using the paths recovered by ViPRA based on a generative 
model for sampling paired reads from the viral population. 
MLEHapIo takes about a day and half of running time on a 
single core machine to process half a million reads, and was 
comparable in run-time to other methods. 


We have tested ViPRA and MLEHapIo on simulated 
viral populations that consist of haplotypes arising from 
common ancestors based on a coalescent model. These 
simulations are more realistic models of viral populations 
rather than introducing random mutations uniformly across 
a known virus. MLEHapIo predicts the smallest set of viral 
haplotypes and has the highest recall for the true haplotypes 
when compared to three existing reference based reconstruc¬ 
tion methods. 

We also evaluated ViPRA and MLEHapIo on reads sim¬ 
ulated from HCV E1/E2 genes identified in an infected 
patient and on a dataset of five HTV-1 strains which has been 
used to test haplotype reconstruction methods. MLEHapIo 
reconstructs haplotypes with greater than 99% sequence 
identity to the true haplotypes for HCV E1/E2 genes and 
greater than 97% sequence identity for the five HTV-1 strains. 
The decrease in sequence identity in the case of real HIV-1 
strains is understandable as the reads are generated from 
replicating viruses where one would observe additional 
sequence variation with respect to the reference. This has 
been documented as single nucleotide polymorphisms with 
respect to the reference in the original study | [3^ and 
as an increase in sequence alignment score in a previous 
method (^ . Nevertheless, the haplotypes reconstructed by 
MLEHapIo better explained the observed error corrected 
sequenced data than the consensus sequences of the known 
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strains, suggesting they form a better representation of the 
sequenced data. Overall, MLEHaplo is able to generate 
haplotypes that capture the true phytogeny of the sequenced 
data. 

The usage of De Bruijn graphs for viral haplotype re¬ 
construction has a number of advantages: The fc-mers as 
vertices avoids the costly computations of reads overlaps. 
Also, because the De Bruijn graph construction is de novo, 
it contains all the variation observed in the viral population 
which can be lost by aligning reads to a reference genome. 
Choosing k greater than D, the size of the largest repeat in 
the viral genome, ensures that the De Bruijn graph obtained 
is acyclic. As repeats in RNA viruses are generally short, 
acyclic De Bruijn graphs can be readily obtained for viral 
populations. However, the De Bruijn graph is sensitive to 
the presence of errors, and an efficient error correction 
algorithm is essential. Error correction using the error cor¬ 
rection software BLESS [M] in addition to removal of source- 
sink paths that had low coverage at their ends aided in 
the reconstruction of full length paths from the algorithm 
ViPRA. The concept of removing low coverage paths from 
the graph has been used earlier in whole genome assembler 
SPADES (35) where tips and bubbles are collapsed using 
similar techniques. 

The set PS stores the fc-mer pairs observed in paired 
reads and it can also combine pairing information from 
multiple overlapping paired reads that are within the insert 
size IS. Efficient storage and computation are possible using 
binary representation for A:-mers and parsing the reads in 
multiple iterations p^ , (40) . Additionally, this computation 
needs to be only performed once for a given sequencing 
dataset. The usage of pairing information for reconstruction 
of paths from De Bruijn graphs has been shown to improve 
performance for single genome assembly (35), (tT) . We have 
used a simplistic version where the number of occurrences 
of a pair of fc-mers is the only information required for the 
path scoring. The presence of variable insert size mate-pair 
data or longer length reads can provide additional support 
for fc-mer pairs in the paired set and further improve the 
performance of ViPRA. 

The proposed scoring mechanism for paths in the De 
Bruijn graph ensures that paths corresponding to the true 
viral haplotypes have a high score. Under the assumption of 
sufficient coverage across a viral haplotype, a path in the De 
Bruijn graph corresponding to a true viral haplotype should 
have high paired fc-mer support within the insert size IS 
in set PS and thus will have a high score. Penalizing the 
score for fc-mer pairs that are missing only within the insert 
size length IS is essential in discarding false positive paths, 
as the absence of such fc-mer pairs has a low probability 
under sufficient coverage. On the other hand, we do not 
expect to see fc-mer pairs that are at distances greater than 
insert size due to our sequencing process, and thus such fc- 
mer pairs are not penalized. This ensures that when ViPRA 
retains the M top scoring sub-paths at a vertex these sub¬ 
paths correspond to segments of the true haplotypes and as 
ViPRA propagates to the source vertices of the graph, ViPRA 
reconstructs paths corresponding to the true haplotypes. 

The number of paths generated by ViPRA is dependent 
on the choice of the parameter M. The choice of M is 
dependent on the inherent diversity of the viral population. 


which can be inferred by the number of vertices in the De 
Bruijn graph. The relative ratio of the counts of occurrences 
of most frequent fc-mers to the average sequencing coverage 
can be used as a lower bound for determining M. As the 
most frequent fc-mers are likely to be shared amongst the 
viral haplotypes in the population, such an M is the mini¬ 
mum that is required for ViPRA to reconstruct all viral hap¬ 
lotypes that share this fc-mer's vertex. We experimentally 
determined that M = 10 was sufficient for reconstructing 
all the true viral haplotypes in the simulated datasets and 
real datasets, even in the presence of sequencing errors. 

MLEHaplo generates a maximum likelihood set of paths 
using the paths generated by ViPRA as a starting point. 
As the number of viral haplotypes in the population is 
unknown, MLEHaplo uses a backward elimination opti¬ 
mization that iteratively removes a path from the set of 
paths reconstructed by ViPRA such that the likelihood of 
the sampled paired reads under the remaining set of paths 
increases. Thus, MLEHaplo further reduces the number 
of paths reconstructed from the graph thereby reducing 
the false positive paths at convergence. The advantage of 
MLEHaplo is that all the sampled paired reads are explained 
by one of the reconstructed paths at convergence. However, 
it is sensitive to the paths reconstructed by ViPRA and 
can only improve on this set of paths. Additionally, as 
a path removed by backward elimination once is never 
considered again, it leads to the retention of haplotypes 
that are close to the true sequences but do not affect the 
likelihood. Removing one path iteratively makes MLEHaplo 
a time consuming step, and parallelization of the code 
can improve the speed of computation. An advantage of 
MLEHaplo and ViPRA over the existing methods is that 
they retain the correct phytogeny of the true viral haplo¬ 
types, even when the reconstructed paths are not an exact 
match to the true viral haplotypes. Thus, the viral diver¬ 
sity can be correctly inferred from the reconstructed paths. 
The software implementation for MLEHaplo is available at 
https:/ / github.com/raunaq-m/MLEHaplo 

5 SUPPLEMENTARY FIGURES 

51 Figure 

Likelihoods of a set of haplotypes reconstructed by MLE¬ 
Haplo. x-axis is the number of haplotypes present in the 
predicted set, y-axis shows the maximum likelihood for any 
combination of predicted haplotypes of the size mentioned 
on x-axis. The vertical bars denote 95% confidence intervals. 
The likelihood drops significantly when fc-mer-pairs that are 
present in the reads are not explained by a current size of 
haplotypes. Inset shows where the maximum likelihood was 
achieved in datasets D1 and D8. 

52 Figure 

Phylogenetic tree of true and predicted haplotypes for 
datasets Dl, D2, D6, and D8. Bootstrap neighbor joining 
tree for the predicted (green) and true haplotypes (red) 
for datasets Dl, D2, D6, and D8. Solid triangles indicate 
perfect sequence match between a reconstructed path and 
a true haplotype that clusters with it, while hollow green 
squares indicate predicted haplotypes that have mismatches 
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to the true haplotypes (hollow red circles). Except for Dl, 
D2, D6, and D8, all datasets have a reconstructed path 
exactly matching each of the seven true haplotypes. D6 has 
more than one reconstructed paths for three out of seven 
true haplotypes, nevertheless, they all group together in the 
neighbor joining tree. 


S3 Figure 

Phylogenetic tree of true and predicted haplotypes for 
datasets D3-D5, D7, and D9-D10. Bootstrap neighbor 
joining trees for the predicted (green) and true haplotypes 
(red) for datasets D3, D4, D5, D7, D9 and DIO. Solid triangles 
indicate perfect sequence match between the reconstructed 
path and a true haplotype that clusters with it. All six 
datasets have perfect reconstruction. 


S4 Figure 

ViPRA: Effect of variation in coverage in length of re¬ 
covered paths for small genome size viral populations. 

Boxplot of length of paths recovered for datasets D6 (first 
column) and DIO (second column) at coverages of 25x, 50x, 
lOOx, 200x, and 400x. The length of paths are normalized 
genome size (1200bp). Partial length paths are recovered at 
low coverages. Full length paths are recovered at coverages 
greater than lOOx. 


S5 Figure 

Distribution of variable and conserved sites for simulated 
viral population and HCV strains. Figure shows the dis¬ 
tribution of variable sites (black lines) and conserved sites 
(gray background) across an alignment of 7 haplotypes in 
Dl dataset (top part), and across an alignment of ten E1/E2 
HCV strains. The distribution for Dl is uniform, while it is 
highly non-uniform for HCV strains. 


56 Figure 

Distance matrix for the simulated viral populations and 
HCV E1/E2 strains. Figure shows the sequence distance dis¬ 
tribution of the viral haplotypes used in simulated datasets 
under the coalescent model and the ten haplotypes in the 
HCV viral population. (A) Dataset D6, (B) Dataset DIO, and 
(C) HCV dataset. 

57 Figure 

Pairwise distance of the predicted haplotypes from 
MLEHaplo for E1/E2 HCV strains. The pairwise dis¬ 
tance heatmaps for the predicted haplotypes by MFEHaplo 
are shown for (A) 43 haplotypes reconstructed in HCV- 
U dataset, and (B) 81 haplotypes reconstructed in HCV- 
P dataset. The blocks observed in the heatmaps indicate 
predicted haplotypes that have similar to each other. The 
colorbar indicates the scale for the pairwise distances. The 
haplotypes are clustered into 27 distinct paths in HCV-U 
and 28 distinct paths in HCV-P dataset. 


58 Figure 

Comparison of pairwise distances for reconstructed hap¬ 
lotypes for E1/E2 HCV strains. (A) The histograms for the 
pairwise distances between the 10 true E1/E2 HCV strains 
is shown. (B) The histogram of the pairwise distances of 
the reconstructed haplotypes by each method in the HCV- 
U dataset (left column) and HCV-P dataset (right column) 
are shown. Haplotypes reconstructed by QuasiRecomb have 
low pairwise distances compared to the true haplotypes, 
while MFEHaplo, ShoRAH resemble the distribution for the 
true haplotypes. 

59 Figure 

Alignment of reconstructed haplotypes from MLEHaplo 
for five HIV clones dataset. The reconstructed haplotypes 
are 1500 bp long, but cover the envelope gene of all the five 
HIV strains with overlapping segments, which can be easily 
joined together. 

6 Tables 


TABLE 1 

Comparison of total number of paths to paths generated by ViPRA 

at M = 10 


Dataset 

Total paths in 
the graph 

# of paths 
from 
ViPRA 

MLEHaplo: 

# of paths 
in H„, b 

Dl 

4824 

50(7) 

7(6) 

D2 

44299 

38(7) 

8(6) 

D3 

325585 

42(7) 

7(7) 

D4 

164387 

52(7) 

7(7) 

D5 

6768 

73(7) 

7(7) 

D6 

1665626 

32(7) 

10(6) 

D7 

2423 

66(7) 

7(7) 

D8 

8712 

74(7) 

8(6) 

D9 

4895 

75(7) 

7(7) 

DIO 

1357 

85(7) 

7(7) 


^ The score of each path P is non-negative. 

^ Number in bracket indicates the # of true haplotypes in the 
population that are retained by ViPRA or MLEHaplo with an 
exact sequence match. There are seven true viral haplotypes 
in each dataset. 


Appendix A 

Rationale that scoring mechanism selects 

PATHS THAT CORRESPOND TO VIRAL HAPLOTYPES 
IN THE POPULATION 

The fc-mer pairs in the set PS is a summarization of the 
observed paired reads and as the scoring of paths is based 
on the set PS, it is useful in selecting source-sink paths that 
represent the viral population. As defined in the text, the 
score S{P) of a path P is defined as : 


S{P) 


1 

W)' 


{l[ir,s)€PS] 

{r,s)GPn [d{s) — d(r)<IS] 


pen ■ l[(r, s) ^ PS]} 


( 5 ) 


We provide a rationale for the paths corresponding to 
true viral haplotypes to have a high score S{P). 




TABLE 2 

Comparison of results on Dl-DlO to existing methods 
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Dataset 

MLEHaplo 

ShoRAH 

QuasiRecomb ® 

PredictHaplo “ 

D1 

7(7) 

36(3) 

6837 (2) 

4(0) 

D2 

8(6) 

56(1) 

9864 (0) 

- 

D3 

7(7) 

49(1) 

1964 (5) 

5(4) 

D4 

7(7) 

39(0) 

8582 (0) 

6(1) 

D5 

7(7) 

- 

9988 (0) 

3(0) 

D6 

10(6) 

17(4) 

4265 (3) 

7(2) 

D7 

7(7) 

8(2) 

7161 (1) 

7(6) 

D8 

8(6) 

29(1) 

9943 (1) 

3(0) 

D9 

7(7) 

14(1) 

8386 (2) 

5(1) 

DIO 

7(7) 

19(1) 

9239 (0) 

4(0) 


* The number in the bracket indicates the number of true haplotypes that 
are present in the predicted set with an exact match. 

^ Consensus sequence generated by VICUNA was used as reference 
sequence for reference based methods SHoRAH, QuasiRecomb, and Pre- 
dictHaplo. 


TABLE 3 

Comparison of number of paths generated on HCV populations. 


Dataset 

MLEHaplo 

ShoRAH 

QuasiRecomb 

PredictHaplo 

HCV-U 

27(43) 

27(280) 

45(1024) 

7(7) 

HCV-U(Recall) 

87.5% 

75% 

62.5% 

50% 

HCV-P 

28(81) 

18(41) 

8(493) 

7(8) 

HCV-P(Recall) “ 

87.5% 

50% 

37.5% 

37.5% 


* The number in the bracket indicates the # of paths predicted by a method, 
while the value outside denotes the number of clusters of paths that are differ 
by greater than 99% sequence difference. 

“ Recall is the fraction of the true E1/E2 HCV strains (out of 8) that are recovered 
with more than 99% sequence identity (less than 10 bp difference) by a method. 


Definition 1 (Sufficient Coverage). The sampled paired reads 
are defined to have sufficient coverage of the viral population H 
if there exists paired reads (Rf,Rr) that sample every haplotype 
i/ G H and there exists a paired read {R'j,R'^) that samples a 
pair ofk-mers {ui, Uj) G H with d{ui, uj) < IS. 

Given sufficient coverage and the definition of the paired set 
PS, if two vertices Ui and Uj are present in a paired read, 
then a path P that contains both vertices Ui and Uj can be 
a possible viral haplotype. Using sufficient coverage, we can 
show that paths corresponding to the true viral haplotypes 
in H have a path score S{P) = 1. Thus in order to extract 
paths from the graph G, it is sufficient to focus on the paths 
with high scores. 

Theorem 1. Given sufficient coverage of the viral population H, 
for a path P in the graph G that corresponds to a viral haplotype 
in H, the score S{P) = 1. 

Proof. The proof can be broken into two parts: 

1. For every viral haplotype Hi G H, there exists a path 
P in the graph G, and 

2. The score S{P) for such paths is 1. 

As sufficient coverage implies that all the viral haplotypes 
Hi G a are sampled by the paired reads, for a given viral 
haplotype H, it follows that there exists vertices Uj in the 
graph G that are sampled from the viral haplotype H, which 
implies that a path P = {ui,U 2 , ..., Um) , where Ui G H, 
exists in the graph. 

The score S{P) for such a path, by definition in equation 

t is the summation over all pairs of vertices {ui,Uj) G P 
at are within the distance IS. Again, by the definition of 


sufficient coverage, all pairs of vertices from a viral haplo¬ 
type H G H within distance IS are sampled by some paired 
read, which implies that {ui, Uj) G PS. Thus the score S{P) 
is : 


S{P) = 


E{P) 


E 


1 [(r, s) G PS”] — pen ■ 1 [(r, s) ^ PS] 


{r,s)^Pnd{s,r)<IS 


1 

W) 


E 


{r,s)^Pnd{s,r)<IS 


l[{r,s)GPS] = ^yE{P) = l 


( 6 ) 


□ 


Appendix B 

Viral Path Reconstruction Algorithm 
(V iPRA) 

Algorithmj^describes the pseudo-code of ViPRA that recon¬ 
structs the top M paths through every vertex in the graph. 
ViPRA starts by initializing the paths from all vertices u to 
the sink vertex Usink to empty sets. The scores for all paths 
from each vertex are also initialized to empty sets (Lines 
1-4). Next it iterates over all vertices u G 14 in increasing 
order of their distance to the sink vertex {d{u,Usink)) to 
compute the top M— paths through each vertex u using the 
function TOP-M-PATHS-FOR-VERTEX(u, E^, PS) (Lines 5- 
10). When a graph has multiple sink vertices, a universal 
sink vertex is defined that has an edge from each of the sink 
vertex to it. 

Algorithm describes the memoized algorithm that 
computes the top M— paths from a vertex u to the 
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Algorithm 1 ViPRA() : Top M paths per vertex based on set 
PS and the graph G 

Input: Directed De Bruijn graph G{V,E), Set PS, d(.) the 
distance between vertex pairs in G. 

Output: Paths(u) = {Pui,Pu 2 , ■ • ■, Pum} Vit € F 
Score{u) = Su 2 , • • ■, Sum} Vm G F 

1: for each vertex u G V do 

2: Paths{u) = [0] //Initialize the paths for each vertex 

3: Score{u) = [0] 

4: end for 

5 : Score{usink) = { 0 } 

6: Paths{usink) = // Place holder between vertices 

7: N — SORT-INCREASING(F, D(.)) // Sort vertices in 
increasing order of distance to the sink vertex and store 
itin 

8: for * = 1, • • • , |A^| do 

9: [Paths{N[i],Score{N[i\))] =TOP-M-PATHS-FOR- 

yEKTEX{N[i\,E,PS) 

10: end for 


sink vertex Us^uk (TOP - M - PATHS - FOR - 
VERTEX{u,E,PS)). It first recovers the top M— paths 
from all the neighbors of u having an incoming edge from 
u into the array TP and their scores in SP (Lines 1-6). 
As the distance d{u,Usink) is greater than the distance of 
its neighboring vertex s, {d{u,Usink) > d{s,Usink)), the 
arrays Paths{s) and Score{s) have already been computed 
(Algorithm]^ in line 9). The score for each of the path in TP 
is updated when adding the vertex u to the path. The path 
scores are updated taking into account memberships in the 
set PS (Lines 7-20). The penalty term (pen) is proportional 
to the length of the path T (Line 16). The paths stored in the 
array TP are sorted based on their updated scores (Line 21). 
The first M—paths in the array TP and their scores SP are 
stored as the paths through vertex u to the sink vertex Usink 
(Lines 22-26). 

B. 0.1 Time complexity analysis of ViPRA 
ViPRA runs Algorithm as its sub-routine and the 
time complexity of the two algorithms is evaluated to¬ 
gether. Lines 1-7 of ViPRA (Algorithm takes 0(|Fc| -t- 
|Fc|log|Fc|) time to initialize and then sort the vertices. 
The \Vc\ calls to the sub-routine TOP-M-PATHS-FOR- 
VLRTLX(A[i], Ec, PS) take 0{M ■ \Ec\) time (Lines 8-10 of 
ViPRA, Lines 1-6 of Algorithm |^. Lach edge (tt, v) G E^ is 
encountered at most M times to store the top M—paths for 
the vertex u. Lines 7-20 in Algorithm look at each path 
T in array TP and query for vertex-pairs {u,w) G PS. 
The number of queries are proportional to the length of 
the path T. Thus, the total number of queries is bounded 
by 0(|TP| • |r|). As the length of path increases linearly to 
the maximum depth in the graph G (0(|Fc|), and the size 
of I TP I is bounded by 0(M), the total time complexity for 
lines 7-20 is 0{M ■ |Fcp). As lines 21-26 perform a sorting 
operation for |TP| elements at a time, its time complexity is 
bounded by 0{M ■ \Ec \ log (L ■ |Pc|)) - Thus overall running 
timeofViPRAisCi(M-|KP+M-|Pc|log(M • |Pe|))-Notice 
that the running time of ViPRA increases log linearly with 


Algorithm 2 TOP-M-PATHS-FOR-VERTLX (u, P, PS) : Top 
M—paths for a vertex u in the graph G 
Input: Vertex u, Edge set E, Set PS, Insert size IS 
Output: Paths{u), Score{u), Set of top 

M—paths for the vertex u, and their respective 
scores 
1: TP = [0] 

2: SP = [0] 

3: for each {u, (u, v) G E} do 

4: TP = JOIN (TP, Paths{v)) //Obtain top M -paths 

of the neighbor 

5: SP = JOIN (SPScoreiv)) 

6: end for 

7: for each path T G TP do 
8: I = length(T) 

9: SP{T) = SP{T) ■ 

10: for each vertex w G T do 

11: if (u, w) G PS then 

12: S'P(T) = SP{T) + 1 

13: else if D{w) — D{u) > IS* then 

14: SP(T) = SP(T) + 1 

15: else 

16: SP{T) = SP(T) - I 

17: end if 

18: end for 

19: SP{T) = SP{T) ■ ^ 

20: end for 

21: (TP,SP) = SORT-DECREASING(TP,S'P(.)) // Sort 
TP paths based on the corresponding scores S'P of the 
paths 

22: for I = 1 ... M do 

23: Paths(u) = JOIN ({M“_"TP[i]},Pof/is(u)) // Add 

u to the path 

24: Score{u) =JOIN ({S'P[i]}, S'core(u)) 

25: end for 

26: Return Paths{u), Score{u) 


the parameter M of the algorithm, and is quadratic in the 
input graph parameters, namely Gc{Vc, Ec). 
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